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Abstract. Probability estimation is essential for every statistical data compression algorithm. 

In practice probability estimation should be adaptive, i. e. recent observations should receive 
a higher weight than older observations. We present a probability estimation method based on 
exponential smoothing that satisfies this requirement and runs in constant time per letter. Our 
main contribution is a theoretical analysis in case of a binary alphabet for various smoothing 
rate sequences: We show that the redundancy w. r. t. a piecewise stationary model with s 
segments is O {s^/n) for any bit sequence of length n, an improvement over redundancy 
O logn) of previous approaches with similar time complexity. 

1 Introduction 

Background. Sequential probability assignment is an elementary component of every sta¬ 
tistical data compression algorithm, such as Prediction by Partial Matching, Context Tree 
Weighting and PAQ (“pack”). Statistical compression algorithms split compression into 
modeling and coding and process an input sequence letter-by-letter. During modeling a 
model computes a distribution p and during coding an encoder maps the next letter x, given 
p, to a codeword of a length close to — logp(x) bits (this is the ideal code length). Decoding 
is the reverse: Given p and the codeword the decoder restores x. Arithmetic Coding is the 
de facto standard en-/decoder, it closely approximates the ideal code length [jll . All of the 
mentioned compression algorithms require simple, elementary models to predict a probabil¬ 
ity distribution. Elementary models are typically based on simple closed-form expressions, 
such as relative letter frequencies. Nevertheless, elementary models have abig impact on both 
theoretical guarantees [|3 IHl and empirical performance [iH [TOl of statistical compression 
algorithms. (Commonly, we express theoretical guarantees on a model by the amount of bits 
the model requires above an ideal competing scheme assuming ideal encoding, the so-called 
redundancy.) It is wise to choose elementary models carefully and desirable to analyze them 
theoretically and to study them experimentally. In this work we focus on elementary models 
with the ability to adapt to changing statistics (see next paragraph) whose implementation 
meets practical requirements, that is 0{Nn) (arithmetic) operations and 0{N) data words 
(holding e. g. a counter or a rational number) while processing a sequence of length n over an 
alphabet of size N. 

Previous Work. Relative frequency-based elementary models, such as the Laplace- and 
KT-Estimator, are well-known and well-understood [HI . A major drawback of these classical 
techniques is that they don’t exploit recency effects (adaptivity): Eor an accurate prediction 



novel observations are of higher importanee than past observations [|3|7]|. From a theoretieal 
point of view adaptivity is evident in low redundaney w. r. t. an adaptive eompeting seheme 
sueh as a Pieeewise Stationary Model (PWS). A PWS partitions a sequenee of length n arbi¬ 
trarily into s segments and prediets an arbitrary fixed probability distribution for every letter 
within a segment. (Sinee both segmentation and predietion within a segment are arbitrary, 
we may assume both to be optimal.) 

To lift the limitation of elassieal relative frequeney-based elementary models we typieally 
age observation eounts, aging takes plaee immediately before inerementing the eount of a 
novel letter. For aging frequeney-based elementary models there exist two major strategies, 
whieh are heavily used in praetiee. In Strategy 1 (eount resealing) we divide all eounts by a 
faetor in well-defined intervals (e. g. when the sum of all eounts exeeeds a threshold) [|2ll, for 
Strategy 2 (eount smoothing) we multiply all eounts by a faetor in (0,1) in every update Q- 
Strategy 1 was analyzed in [|5l and has redundaney 0{s^/n log n). Similarly, a KT-estimator 
whieh eompletely diseards all eounts periodieally was analyzed in flSl and has redundaney 
0{s\/n logn). Strategy 2 was studied mainly experimentally [171191 [T^. 

Another approaeh for adaptive probability estimation, eommonly used in PAQ, is smooth¬ 
ing of probabilities. Strategy 3. Given a probability distribution (i. e. the predietion of the 
previous step) and a novel letter we earry out an update as follows: First we multiply all prob¬ 
abilities with smoothing rate a G (0,1) and afterwards we inerement the probability of the 
novel letter by 1 — a. Smoothing rate a does not vary from step to step. To our knowledge this 
eommon-sense approaeh was first mentioned in (Sll . A finite state maehine that approximates 
smoothing was analyzed in [0 and has redundaney 0{nK~^^^) w. r. t. PWS with s = 1, 
where K is the number of states. 

All aforementioned approaehes meet praetieal demands, they require 0(A^?7,) (arithmetie) 
operations and 0{N) data words. More eomplex (but unpraetieal) methods are based on mix¬ 
tures over elementary models assoeiated to so-ealled transition diagrams [Elllll or assoeiated 
to (PWS-)partitions llTOll . 

Our Contribution. In this work we analyze a generalization of strategies 2 and 3 for a binary 
alphabet. Based on mild assumptions on sequenee ai,a 2 ,. ■ ■ of smoothing rates (ak is used 
for an update after observing the fc-th letter) we explieitly identify an input sequenee with 
maximum redundaney w. r. t. PWS with s = 1 and subsequently derive redundaney bounds 
for s > 1 (Seetionj^. For PWS with arbitrary s we give redundaney bounds for three ehoiees 
of smoothing rates in Seetion]^ First, we eonsider a fixed smoothing rate a = ai = a 2 = ■ ■ ■ 
(as in PAQ) and provide a* (n) that guarantees redundaney 0(s^/n) for a sequenee of length 
n; seeond, we propose a varying smoothing rate, where ak ~ a*(k); and finally a varying 
smoothing rate that is equivalent to Strategy 2 from the previous seetion. By tuning param¬ 
eters we obtain redundaney 0(s-y/n) for all smoothing rate ehoiees, an improvement over 
redundaney guarantees known so far for models requiring O (Nn) (arithmetie) operations per 
input sequenee. Seetion [^supports our bounds with a small experimental study and finally 
Seetionj^summarizes and evaluates our results and gives perspeetives for future researeh. 

2 Preliminaries 

Sequences. We use to denote a sequenee XiXi+i.. .xj of objeets (numbers, letters,...). 
Unless stated differently, sequenees are bit sequenees (have letters {0, 1}). If i > j, then 



Xi,j := 0, where 0 is the empty sequenee; if j = oo, then Xi.j = XiXi+i... has infinite 
length. For sequenee Xi.n define x^i := Xi.i^i and x<j := Xi.i, we eall xi.n deterministic, if 
Xi = ■ ■ ■ = Xn, and non-deterministic, otherwise. 

Code Length and Entropy. Code length is measured in bits, thus log := log 2 . For prob¬ 
ability distribution p over {0,1} and letter x we define i{x;p) := — logp(a:). The binary 
entropy funetion is denoted as H{q) := —q\og{q) — {1 — q) log(l — q), for a probability 
q. For sequenee Xi.,n and relative frequeney g of a 1-bit in xi.,n let h{xi:n) ■= nH{q) be the 
empirical entropy of Xi.n- 

Partitions and Sets. Calligraphie letters denote sets. A partition of a non-empty segment 
(interval) (a, h] of integers is a set of non-overlapping segments (zq, zi],..., (z„_i, z„] s. t. 
a = zq < Zi < ■ • • < z„ = 6. The phrase k-th segment uniquely refers to (4-i,4]- 

Modeis and Exponentiai Smoothing of Probabiiities. We first eharaeterize the term model 
from statistieal data eompression, in order define our modeling method. A model MDL maps 
a sequenee a;i:n of length rz > 0 to a probability distribution p on {0,1}. We define the short¬ 
hands MDL(x<n) := p (this is not the probability of sequenee a;<n!) andMDL(a:; a:<n) := p(x). 
Model MDL assigns £(a;<„; MDL) := — log MDL(a;fc; a;<fe) bits to sequenee a;<n. We 

are now ready to formally define our model of interest. 

Definition 2.1. For sequence ai-^oo, where 0 < ai, a 2 , ■ ■ ■ < 1, and probability distribution 
p, where p(0),p(l) > 0, we define model ESP = {ai-.oo,p) by the sequential probability 
assignment rule 



Smoothing rates eontrol the adaption of ESP, large Ofi’s give high weight to old observations 
and low weight to new observations, the eonverse holds for small a^’s. For our analysis we 
must assume that the smoothing rates are suffieiently large: 

Assumption 2.2. ESP = (ai:oo,p) satisfies ^ < ai, 0 : 2 , • ■ ■ < 1 and w. 1 . o. g. p(0) < p(l). 

For the upeoming analysis the produet of smoothing rates plays an important role. Henee, 
given smoothing rate sequenee we define 0o = 1 and 0* := «! ■ ... • Oj, for z > 0. 


3 Redundancy Analysis 


First Main Resuit. Now we ean state our first main result whieh eompares ESP to the eode 
length of an optimal fixed eode for xi-,n, that is the empirieal entropy h{xi-,n ). Before we prove 
the theorem, we diseuss its implieations. 

Theorem 3.1. If Assumption \2.2\ holds, then we have 
£(xi:„; ESP) - h{xi-,n) 


< 


Yh =0 log is deterministic 


( 2 ) 


log log otherwise 







Recall that by Assumption |2.2| we have p(0) < p(l). First, consider a deterministic sequence 
of 0-bits. By ([T]) we have ESP(l;a;<i) = /3ip(l),thus ESP(0;a;<j) = 1 —both for 0 < 
i < n. The total code length is ESP) = — log(l~ Ap(l)) and clearly= 

0, so the redundancy i{xi.,n; ESP) — h{xi.,n) matches Now consider a non-deterministic 
sequence Xi:„ = 00 ... 01 with single 1-bit at position n. Similar to the deterministic case 
the same equations for ESP( ■ ; x<j) hold, for i < n. The total code length is i{xi.,n', ESP) = 
- Er=o^ log(l - Ap( 1)) - log(/3n-iP(l)) and the empirical entropy is h{xi.,n) = nH{^). 
Again, the redundancy matches In summary, if p(0) < p(l), then 00 ... 0 is a determin¬ 

istic sequence with maximum redundancy and 00 ... 01 is a non-deterministic sequence with 
maximum redundancy. Similar statements hold, if p(0) > p(l): by symmetry we must tog¬ 
gle 0-bits and 1-bits. When p(0) = p(l) we have equal redundancy, e. g. £(00 ... 0; ESP) = 
£(11... 1; ESP) in the deterministic case. In summary, for a given instance of ESP (that satis¬ 
fies Assumption |2.2[ ) the worst-case input is either 00 ... 0 (only 0-bits) or 00 ... 01 (single 1- 
bit), among all 2” bit sequences of length n. For fixed n we can now easily compare the redun¬ 
dancies of those two inputs and immediately depict the worst-case input and its redundancy. 

For the proof of Theorem |3.1| we require the following lemma. 

Lemma 3.2. Any non-deterministic sequence Xi.n of length n >2 satisfies 


h{Xi.,n) - h{x2-,n) > 


nH (^) , ifx 2 -.n is deterministic 

nH (f) — (n — 1)H , otherwise 


Proof. Let 1 — p be the relative frequency of xi in xi:„, thus h{xi-,n) — h{x 2 ,n) = nH (p) — 
{n — l)H -p) =: /(p). We distinguish two cases: 

Case 1: X 2 :n is deterministic. Wehawep = ^ and/(p) = nH (^) = nH (^). 

Case 2: X 2 ,n is non-deterministic. Since H{p) is concave, H'{p) is decreasing and f'{p) = 
n [H'{p) - H' ■ p)] > 0, i. e. /(p) is increasing and minima! for minimum p. Since 
Xi.„ is non-deterministic the minimum value of p is ^ and we get/(p) > / (^) = nH (^) — 


Now let us proceed with the major piece of work in this section. 

Proof of Theorem 13.1 [ We define r(xi:„, ESP) ;= £(xi:„; ESP) — h{xi.,n) and distinguish: 

Case 1: xi:n is deterministic. By p(0) < p(l) (Assumption |2.2| ) we have ESP(x;x<i) > 
ESP(0;x<i) = 1 - p(l)/3i_i and h(xi:„) = 0, we get 


r(xi;n, ESP) = ^ log 


l<i<n 


ESP(xi;x<i) 


< 




0<i<n 


p(i)A’ 


Case 2: xi.n is non-deterministic. We have n > 2 and by induction on n we prove 
r(xi:^, ESP) < log log^—-nF£(i). 




Base:n = 2 .—We have xi:„ G (01,10}, in either case /;,(xi:„) = nH = 2 and 
£(^i-; ESP) = log = log ^ + log the claim follows. 















Step: n > 2. — By defining ESP' = where a- = ctj+i, I3[ = a[ ■ ■ a', 

p' = ESP(a;<i) we may write 


r(Xi:n, ESP) 



+ r{x2-,n, ESP') - {h{xi.,n) - h{x2-n))- 


(3) 


Now w. 1. o. g. fix p' s. t. p'(0) < p'{l). Sinee we want to bound @ from above, we must 
ehoose xi s. t. p{xi) is minimal (and the r. h. s. of @ is maximal). To do so, distinguish: 


Case 1: xi = 0. For some distribution q with q'(O) > 0 we have p{xi) = q(0) and 1 > 
p'(0) = Q;ig(0) + 1 — Q?!, thus g(0) < \ai — 3] /ai. (Notiee the subtle detail: Oi < 3 
implies g(0) < 0, whieh eontradiets g(0) > 0 and would make Case 1 impossible; however 
we assumed tti > 3.) Furthermore, we have g(0) < 

Case 2: xi = 1. For some distribution r with r(l) > 0 we have p{xi) = r(l) and 1 < 
p'(l) = air(l) + 1 — Q;i,thusr(l) > \ai — 3] /«!. 

Sinee g(0) < r(l) (i. e. Case 1 minimizes p(a;i)) and g(0) < 3 we may now w. 1. o. g. 
assume that xi = 0,p'(l) = aip{l),p{xi) = 1 — p(l) andp(O) < p(l). We distinguish: 


Case 1: X 2 -.n is deterministic. We must have X 2 :n = 11... 1, sinee xi = 0 and xi^n is non- 
deterministie, thus 


r{x 2 -,n, ESP') 


0<^<n—1 


l-p'(0)/3' 


< log 


P(l)/^n-l 


l<^<n—1 


1 

1 -p(i)A’ 


(4) 


where we obtain the inequality by p'(0)/3' < p'(l)/3' = p(l)/5i+i, for i < n — 2 and 
1 ~ P'{^)P'n -2 = 1 — [1 — p(l)ai]/9n-i/tti > p(l)/5n-i7 fov i = u — 2. To ob tain the 
elaim we plug the inequalities Q and h(xi:„) — h{x 2 ,n) > (by Lemma |3.2[ ) into Q 

and note that p(a;i) = 1 — p(l)/?o (sinee/3o = 1)- 


Case 2: X 2 -.n is non-deterministic. The hypothesis andp'(l)/3' = 


= log 


+ 


1-P(1)A 


p{l)^i+i yield 

-(«-!)-» (;^)- 


(5) 


Weplugtheinequalities@andh(a;i:„)—/i(a:2:n) > n,if(4) —(n—l)if(^ 
into @ and note thatp(a;i) = 1 — p(l)/9o (sinee /So = 1) to end the proof. 


) (by Lemma 3.2), 


Second Main Result. Let us now extend the eompeting seheme of Theorem |3.1[ to whieh 
we eompare ESP to. Suppose the eompeting seheme splits the input sequenee xi.n aeeording 
to an arbitrary partition 5 of [1, n] and may use an optimal fixed eode within every segment 
[a, b] G S. The eompeting seheme has total eoding eost h{xax) for Xa,h, thus eoding eost 
XI [a fe]e-s for Xi.n Notieo, that this a lower bound on the eoding eost of any PWS with 

partition S. Sinee the situation within a segment resembles the situation of Theorem |3.1[ we 
may now naturally extend the redundaney analysis to the aforementioned eompetitor. 

Theorem 3.3. Let S be an arbitrary partition of [1, n]. If Assumption |2.2| /zc>ZJ5, then 
/(x,,„ESP)- y: ftK,)<|s|iog-^^+ 

[a,b]e 5 {a,b]&Sa<i<b h'l/h'a 


( 6 ) 

















Proof. Let r(xi:„, ESP) := £(a;i:„; ESP) — Our plan for the proof is to simplify @ 

(see ealeulations below) to yield 


ESP) - h{xi.,n) < log 


E ‘“STTr 




(7) 


and use to Q to bound the redundaney for an arbitrary segment (a, b] from S (see ealeulations 
below) via 


log 


a<i<b 


ESP{xi-,x<i) 


h{Xa+l:b) < log 


l°g 


p{0)/3b-i 1 I3i/I3a 


( 8 ) 


We now obtain @ easily by summing @ over all segments (a, h] from S and by /9b_i > Pn-l. 

Simplifying Observe that Eo<i<n log = 1°S ^ + Ei<i<n log and fur¬ 

thermore p(0) < p(l). So bound @ beeomes 

ESP) < log ^ + £log < log ^ +^£log 

if Xi:n is deterministie and by log — nH (^) < 0 (sinee p(l) > | and n > 2) 


r{xi.,n, ESP) < log 


< log 




p{0)p{l)/3n-l 1-P(1)A 

p(.0)Pn-l ^ 1 - ft' 


nH (}) 


if xi:n is non-deterministie. In either ease bound Q holds. 

Redundancy of {a, b]. For segment (a, b] we define sequenee x[.^_^ = Xa+ix and ESP' = 
(a'i:ooiP0> s.t. ESP(a;;a;<i) = ESP(a;';a;'<j_^) for i G (a,6]. Therefore, let= aa-ruoo, 
I3[ = a'l - ... ■ a' and w. 1. o. g. p'(0) < p'(l). We obtain 




a<i<b 


ESP{xi;x^i) 


- h{Xa+l-.b) 


l°g ESP'(l'-x' ) ~ 

l<i<b-a ^ *’ 


S '°s — + J2 '°s 7^ < log ^ 




E 




where the last step is due top'(O) > p{0)(3a (alsop'(l) > p{0)(3a) and /3' = (3a+i/(3a- □ 


4 Choice of Smoothing Rate Sequence 

Fixed Smoothing Rate. A straight-forward ehoiee for the smoothing rates is to use the same 
rate a in every step. This leads to a simple and fast implementation, sinee no smoothing rate 
sequenee needs to be eomputed or stored. We require the following lemma for the analysis: 

Lemma 4.1 . For 0 < a < 1 we have Y.i<i<m log 

























Proof. For m = 0 the bound trivially holds, let m > 1. Sinee log is deereasing in z and 
integrable for 2; in [ 0 , cxd) we may bound the series by an integral, 


log 

l<2<m 


1 

1 — a* 


< 




—dz = log(e) 



(9) 


The equality in Q follows from the series expansion In = J2j>i y^/j, for \y\ < 1. To 
end the proof, it remains to bound the integral in @ as follows (notiee J2j>i 6): 


J J Jo 


a^^dz 


i>i 


log e 1 — ^ TT^ log e 

log “ ~ 6 log - 

® CX. J >1 ® Q 


Corollary 4.2. Let S be an arbitrary partition of [l,n]. If a = ai = a 2 = 
Assumption \2.2\holds, then 


f(a:i;„;ESP) 


[a,fe]SiS 


h{^a-.b) < l^l 


1 (yrloge)^ 1 


... and 


( 10 ) 


Proof. We have/ij = a*, thus fori e (a, 6] we plug the estimate 

1 . 1 


E '“srr 


and log jdn-i = (n — 1) log a into ^ 


i°s 


1 — a* 


Lei^O^TT log e) 


0<2—a<6—a 

I to conclude the proof. 


61og^ 


□ 


Choosing a = e V6("-i) minimizes the r. h. s. of bound ( fTO] ) and satisfies ot > \ (Assump¬ 
tion]^^, when n > 5. The optimal choice gives redundancy at most 


151 


27r log e 


n + log 


p(o). 


< 151 


3.701 ■ ^/n + log 


p(0) 


( 11 ) 


Varying Smoothing Rate. It is impossible to choose an optimal fixed smoothing rate, when 
n is unknown. A standard technique to handle this situation is the doubling trick, which will 
increase the -\/n-term in ( [TT] ) by a factor of 1 / 2 /{\/2 — 1) 3.41. However, we can do better 
by slowly increasing the smoothing rate step-by-step, which only leads to a factor a/2 1.41. 

Corollary 4.3. Let S be an arbitrary partition of [l,n]. Ifak = (i.e. > ^) 

and Assumption ^f2\ holds, then 

£(xi.,^; ESP) - E h(x,:,) < IS 
[a,b]GS 


log 


27r log e 
P(0) ' v/3 


+ 


( 12 ) 


Proof. We have A = exp [-^Ei<k<i+i 
A’s in @ from above. First, observe that 

y < r A < 2^ ss 


k ) and bound the terms depending on the 


l<k<n 


1 vrloge _ 


(13) 


second, for a < i < 6 we have A//a = cta+i ■...■«* < (a„_i)* “, since i < n and 
tti, 02 , • • • is increasing, consequently we obtain 


E l —V 1 Lem. [ 


.llll](7r loge)^ vrloge 

6 log Vs 


n. (14) 


^n — 1 






































We plug ( fT3] ) and ( [T4| ) into the result is ( [T^ . 


□ 


Count Smoothing. Consider aging Strategy 2 from Seetion with smoothing rate A G 
(0,1). We will now show that Strategy 2 is an instanee of ESP. For sq, > 0 we define the 
smoothed eount s{x; x<k) of bit x and the smoothed total eount tk as follows 

{ As(a;; a;<fc) + 1, if /c > 0 and Xk = x 
Xs{x]x^k), if A; > 0 andxfc 7^ a: andffc : = 

Sj;, if /c = 0 

Strategy 2 prediets p(a;;a;<fc) = s{x; x<k)/'tk- Ineasea^fc = xweget 

As(a;;a;<fc) +1 Xtk-i s{x; x^k) ,1 4-1 . x ,1 

p{x;x<k) = ---= —---+ — = —— p{x;x<k) + —, 

tfc tfc tfc-i tfc tfc tk 

similarly p(a:; a:<fc) = a:<fc), if ^ x. If we now ehoose = %=^andp(x) = 

the above sequential probability assignment rule resembles ([IJ). This insight allows us 
to adopt our analysis method. To do so, we require the following teehnieal statement first. 

Lemma 4.4. For 1 < a <h and 0 < A < 1 we have > |. 

Proof. Let/( 2 ;) ;= ln((l —A^)/ 2 ;),itsuffieestoprovethat/(a) > /(6).BylnA^ > 1 —1/A^ 
we get f'{z) = [(1 — In A^) ■ A^ — 1] / [a(l — A“)] < 0, so / is deereasing. □ 


A4-1 + 1, if > 0 
Sq T Si, if /c = 0 


Corollary 4.5. Let S be an arbitrary partition of [1, n\. Fix 0 < A < 1 and m >1, define 
tk := Xtk-i + 1 for k > 1 and 4 = 1 + A + ■ ■ ■ + X^~^ for /c = 0. If Ok = ^4^ and 
Assumption \2.2\holds, then 


ESP) 


[a,b]giS 


KXa:b) < l^l 


, n (vrloge)^ , 1 

-JiF + ^ + (n - 1) log - 

p(0) 6 logy A 


(15) 


Proof. Let k > 1 and note that by tk = Xtk-i + 1 we may write = A4-i/4 and 
4 = 1 + A + ■ ■ ■ + = (1 — A^+”^)/(I — A) and get 


Xtf) Xti Xti_i to ■ 1 

Pi = ai ■ ... ■ ai = -...-= — ■ A = — 


A" 


4 4 4 4 1 - A'"+* 

We now proeeed by bounding the terms dependent on 4 in 


■A*. 


^ (1 — A”^) A* Lem. 133] 772 A* 

fii=- -> 


1 — A™-+* m + i 

From the above inequalities we obtain 
1 , n 


A* Bi 1 

> - and — = — 

l + l (Ba 1 


ym+a a<i 

■A*"“ < A*““ 


\m+i 


log 


f^n—l 


<log 3 yn and y] log :j—< y] log — 


/Bi/fia 


A* 


Lei^l33](7r log efi 


a<i<b ' a<i<b 

Finally we plug the above inequalities into ® and rearrenging yields ( fT5| ). 


6 logy 


□ 


For A: —)■ oo we have tk —)• thus ak —)• A, i. e. we expeet the smoothed eounts method to 

perform similar to ESP with fixed smoothing rate A, when the input is large enough. Bound 
refleets this behavior, it differs from ( fTO] ) only by the additive term |5| log n. Further¬ 
more, the optimal value of A in ( fT5| ) matehes the optimal value of a in i(ig. 

























5 Experiments 


For inputs of length n we experimentally eheeked the tightness of our bounds from the previ¬ 
ous seetion for a wide range of ESP-instanees with smoothing rate ehoiees (i) fixed “optimal” 
smoothing rate a = exp(—7r/-^6(n — 1)) (here “optimal” means that the eorresponding 
bound, e. f. Corollary 4.2[ is minir nize d), (ii) varying smoothing from Corollary |4.3| and (iii) 


varying smoothing from Corollary 
Sinee our bounds fromeorollaries 


4.5 

3 


with “optimal” A = exp(—7r/-^6(n — 1)) andm = 1. 
4.3|and|43]are worst-ease bounds we eompare them to 


the empirieally measured (approximate) worst-ease redundaney. Furthermore, we eompare 
the (approximate) worst-ease redundaney of (i), (ii) and (iii) to eaeh other. We now explain 
the details below. 


Experimental Setup. In the following let smoothing rate sequenee Q!i:oo, input length n = 
1000, partition 5 = {(0, 200], (200, 700], (700,1000]} ande = 0.05 be fixed. (Weinspeeted 
the outeome of our experiments for different parameters and got similar results, henee these 
values.) We want to judge on our bounds on a wide range of ESP-instanees, in partieular we 
ehooseelassC = {(q!i:oo,p) | 0 < £ < p(0),p(l)} of ESP-instanees. To do so, we have to 
modify our bound slightly, we must replaee the term p(0) by e:: For instanee, in Situation (i), 
we may bound the redundaney of any ESP G C of prefix xi± of given xi.n as follows 


t{xi,k, ESP) - E h(x a:min{fc,6}) ^ 1^5 
[a,fe]SiS 


27r log e 

ye 


■ \//c - 1 -f log 



(16) 


for 1 < fc < n. Sinee the resulting bounds remain worst-ease bounds, we eompare the 
resulting bounds for situations (i)-(iii) to the worst-ease redundaney 

f(a:i,fc;ESP) - 

[a.feJScS 


rik) := max 

ESPGC,3:1;„ 


[Xn 




(17) 


Unfortunately, eomputing the maximum is intraetable, sinee C is uneountably infinite and 
there are exponentially many sequenees xi-,n. To lift this limitation we take the maximum 
over a finite subset of ESP-instanees from C and inputs xi,n, speeified as follows: For num¬ 
bers go, • • •, g| 5 | £ {0.05,..., 0.95} we eonsider pairs (ESP, xi-,n) s. t. ESP(0; 0) = go (go 
determines an ESP-instanee) and xi:n is drawn uniform at random from all sequenees where 
for the i-th segment [a, 6] G 5 subsequenee Xa-b has exaetly [g* ■ (5 — a -I- 1)J 1-bits (g* 
determines the (approximate) fraetion of 1-bits in the z-th segment). We now take the maxi¬ 
mum in ( fTT] ) over all eombinations (go,..., g| 5 |) and repeat the random experiment 100 times 
for every eombination (go,..., g| 5 |) (in total ■ 100 simulations). Figure ndepiets the 

approximation of r{k) (solid lines) and our bounds on £{x<k, ESP) — X][a qe-s 
(dashed lines). (For instanee, bound ( [T^ is depieted as dashed line in the left plot of Figure[Tj) 


Approximate Worst-Case Redundancy. We now eompare (approximate) r{k) for smooth¬ 
ing rate ehoiees (i)-(iii) and observe: On one hand, as long as k is small, varying smoothing 
rates, (ii) and (iii), yield lower redundaney than (i), and (iii) performs better than (ii). On the 
other hand, when k is large (i), (ii) and (iii) don’t differ too mueh. The inerease in redundaney 
dXk = 201 and k = 701 is nearly identieal in all eases, the differenee in redundaney is almost 
entirely eaused by segment (0,200]. 


Bounds Behavior. Now we eompare the bounds to (approximate) r{k). In general, the 
tightness of our bounds deereases as the number of segments inereases. This is plausible. 
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Figure 1: Redundancy bound (dashed line) and approximate worst-case redundancy r{k) 
(solid line) of class {(ai:oo,p) | 0 < e < p(0),p(l)} for e = 0.05 w.r.t. competitor with 
partition S = {[1, 200], (200, 700], (700,1000]} on the length-fc prefix, 1 < /c < n of a 
sequence with length n = 1000. The x-axis is prefix length k and the y-axis is redundancy in 
bit. Every plot corresponds to a different smoothing rate choice: (i) fixed “opt imal” smoothing 


rate a = exp(—vr/Y^O^rr^^T)), (ii) varying smoothing from Corollary 4.3 and (iii) varying 


smoothing from Corollary |4.5|with “optimal” A = exp(—7r/y^6(n — 1)) and m = 1. 


since we essentially eoneatenated the worst-ease bound for |5| = 1. However, we don’t 
know, whether or not the worst-ease redundancy for |5| = 1 can appear in multiple adjacent 
segments at the same time. Experiments indieate that this may not be the ease. Furthermore, 
in (i) the bound is tightest, espeeially within segment (0, 200]. In eases (ii) and (iii) the bounds 
are more loose. An explanation is, that in the eorresponding proofs we worked with rather 
generous simplifications, e. g. when bounding — Ylia<i<b ~ A/Z^a) from above. If we 
eompare (ii) to (i) and to (iii) we ean see, that bound (ii) is tighter for very small k. The reason 
is simple: Bound (ii) does not depend on a smoothing rate parameter, whereas (i) contains the 
term 1 / log ^ and (iii) eontains the term 1 / log j. These terms dominate the bounds, when k 
is small and a and A are elose to 1. (We have a = A 0.96, sinee a and A were chosen to 
minimize the eorresponding bound forn = 1000.) 


6 Conclusion 

In this work we analyzed a elass of praetieal and adaptive elementary models whieh assign 
probabilities by exponential smoothing, ESP. Our analysis is valid for a binary alphabet. 
By ehoosing smoothing rates appropriately our strategy generalizes eount smoothing (Strat¬ 
egy 2) and probability smoothing from PAQ (Strategy 3). Due to its low memory footprint and 
linear per-sequenee time eomplexity ESP is attraetive from a praetieal point of view. From a 
theoretie point of view ESP is attractive as well: For various smoothing rate sequenees ESP has 
redundaney only O(svA) above any PWS with s segments, an improvement over previous 
approaches. A short experimental study supports our bounds. 

Nevertheless, experiments indicate that there is room for an improved analysis. Despite 
minor teehnieal issues a major approaeh would be to obtain redundaney bounds w. r. t. PWS 
that take the similarity of adjacent segments into aeeount. That is, if adjaeent segments have 
very similar distributions, the inerease in redundaney should be small, eompared to adjaeent 
segments with drastieally different distributions. Furthermore, it is desirable to generalize 
the analysis to a non-binary alphabet. We defer a thorough experimental study that eompares 
ESP to other adaptive elementary models to future research. 
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